- Everitt and Hothorn (2011) An Introduction to Applied Multivariate Analysis with R. Springer.
- Tabachnick and Fidell (2013): Using Multivariate Statistics. Pearson.
Multivariate data is not just a challenge for visualization.
You have too many variables
Your number of variables approaches your number of observations.
A linear combination of variables
Multiple Regression Makes a Composite Variable
Goals:
No predictor or outcome variable. Think of it as a one sided equation.
For q variables, you get q PCs
Find the indices of the minimum and maximum of x1
themin <- which.min(M$Fat) themax <- which.max(M$Fat)
Make a data.frame with the points for plotting
minpt <- data.frame(x1 = M$Fat[themin], x2 = M$Lactose[themin]) maxpt <- data.frame(x1 = M$Fat[themax], x2 = M$Lactose[themax])
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
prcomp() is the preferred function for PCA (not princomp()):
z <- prcomp(~ Fat + Lactose, data = M, center = TRUE, scale. = TRUE)
~ Fat + LactosePackages:
ade4: https://cran.r-project.org/web/packages/ade4/index.htmlFactoMineR: https://cran.r-project.org/web/packages/FactoMineR/index.htmlWe will use the built-in prcomp().
prcomp objectsstr(z)
## List of 6 ## $ sdev : num [1:2] 1.393 0.242 ## $ rotation: num [1:2, 1:2] 0.707 -0.707 -0.707 -0.707 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "Fat" "Lactose" ## .. ..$ : chr [1:2] "PC1" "PC2" ## $ center : Named num [1:2] 34 49.6 ## ..- attr(*, "names")= chr [1:2] "Fat" "Lactose" ## $ scale : Named num [1:2] 14.3 14.1 ## ..- attr(*, "names")= chr [1:2] "Fat" "Lactose" ## $ x : num [1:29, 1:2] -1.785 -1.444 -1.962 -2.066 -0.514 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:29] "1" "2" "3" "4" ... ## .. ..$ : chr [1:2] "PC1" "PC2" ## $ call : language prcomp(formula = ~Fat + Lactose, data = M, center = TRUE, scale. = TRUE) ## - attr(*, "class")= chr "prcomp"
x is a matrix of the the principal components
summary(z)
## Importance of components: ## PC1 PC2 ## Standard deviation 1.3934 0.24158 ## Proportion of Variance 0.9708 0.02918 ## Cumulative Proportion 0.9708 1.00000
PC1 explains 97% of the variance in the data.
Correlations with with composite variable
print(z)
## Standard deviations (1, .., p=2): ## [1] 1.3934265 0.2415836 ## ## Rotation (n x k) = (2 x 2): ## PC1 PC2 ## Fat 0.7071068 -0.7071068 ## Lactose -0.7071068 -0.7071068
Lactose loads negatively on PC1 and PC2Fat loads positively on PC1 and negatively on PC2PC <- data.frame(pc1 = z$x[, 1],
pc2 = z$x[, 2])
PC
## pc1 pc2 ## 1 -1.78509431 -0.063653225 ## 2 -1.44365796 0.013484299 ## 3 -1.96166269 0.006259316 ## 4 -2.06645523 -0.177723875 ## 5 -0.51393051 0.150315276 ## 6 -0.91347766 0.350637364 ## 7 -0.07717306 0.351480511 ## 8 1.90779088 -0.014528235 ## 9 1.55556364 0.358828937 ## 10 1.95386703 0.311684874 ## 11 0.95822267 0.229702746 ## 12 -2.55254845 0.423074692 ## 13 0.62357110 0.189046494 ## 14 0.53522693 0.247205046 ## 15 0.42873297 -0.083992416 ## 16 0.69910152 0.089367582 ## 17 1.73099540 -0.155687913 ## 18 1.98506466 -0.145129935 ## 19 1.36259231 -0.113289467 ## 20 1.75461984 -0.256788764 ## 21 -1.99486571 -0.210072095 ## 22 -0.14099491 -0.192434707 ## 23 -0.90363007 -0.154254327 ## 24 -0.07624607 -0.451377680 ## 25 -0.68540391 -0.011259289 ## 26 -0.11968797 0.216906468 ## 27 -1.51615118 -0.248076624 ## 28 -0.06089007 -0.342972655 ## 29 1.31652081 -0.316752397
Green = Low Fat, High Lactose
Blue = High Fat, Low Lactose
## pc1 pc2 ## 1 -1.785094 -0.06365323
biplot()biplot(z)
factoextra for more handy functionsinstall.packages("devtools")
library(devtools)
install_github("kassambara/factoextra")
library(factoextra)
get_eig(z)
## eigenvalue variance.percent cumulative.variance.percent ## Dim.1 1.94163735 97.081867 97.08187 ## Dim.2 0.05836265 2.918133 100.00000
Eigenvalues are variances. Default R print() method returns standard deviations.
fviz_eig(z, addlabels = TRUE)
fviz_pca_var(z) + coord_fixed()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
fviz_pca_ind(z, geom = "text")
PC1 <- z$x[, 1] summary(lm(Milk_Energy ~ PC1, data = M))
## ## Call: ## lm(formula = Milk_Energy ~ PC1, data = M) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.10569 -0.07164 0.01139 0.04552 0.11731 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.641724 0.012138 52.87 < 2e-16 *** ## PC1 0.106277 0.008865 11.99 2.54e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.06537 on 27 degrees of freedom ## Multiple R-squared: 0.8418, Adjusted R-squared: 0.836 ## F-statistic: 143.7 on 1 and 27 DF, p-value: 2.539e-12
fm_Multi <- lm(Milk_Energy ~ Fat + Lactose, data = M) summary(fm_Multi)
## ## Call: ## lm(formula = Milk_Energy ~ Fat + Lactose, data = M) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.11350 -0.05047 0.01103 0.04649 0.12701 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.007372 0.211168 4.770 6.16e-05 *** ## Fat 0.001952 0.002533 0.771 0.44784 ## Lactose -0.008709 0.002575 -3.382 0.00229 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.06447 on 26 degrees of freedom ## Multiple R-squared: 0.8518, Adjusted R-squared: 0.8404 ## F-statistic: 74.74 on 2 and 26 DF, p-value: 1.657e-11
M <- read_excel("../data/mammals.xlsx", na = "NA")
M <- M %>% select(litter_size,
adult_body_mass_g,
neonate_body_mass_g,
max_longevity_m,
sexual_maturity_age_d)
names(M) <- c("Litter_Size", "Adult_Mass",
"Neonate_Mass", "Longevity", "Maturity_Age")
M <- M %>% drop_na()
~ . means all columns
z <- prcomp(~ .,
data = M,
center = TRUE,
scale. = TRUE)
get_eig(z)
## eigenvalue variance.percent cumulative.variance.percent ## Dim.1 2.6828230 53.656461 53.65646 ## Dim.2 1.4775239 29.550478 83.20694 ## Dim.3 0.6218664 12.437328 95.64427 ## Dim.4 0.1541320 3.082640 98.72691 ## Dim.5 0.0636547 1.273094 100.00000
print(z)
## Standard deviations (1, .., p=5): ## [1] 1.6379325 1.2155344 0.7885851 0.3925965 0.2522988 ## ## Rotation (n x k) = (5 x 5): ## PC1 PC2 PC3 PC4 PC5 ## Litter_Size 0.3029675 -0.4765735 0.8184362 0.09980480 0.03591135 ## Adult_Mass -0.4462968 -0.5301452 -0.1399632 -0.26489329 0.65574932 ## Neonate_Mass -0.4761971 -0.4894702 -0.0803547 0.02829269 -0.72553281 ## Longevity -0.5400521 0.2503846 0.2471013 0.74134502 0.18708228 ## Maturity_Age -0.4365890 0.4353737 0.4930076 -0.60784672 -0.08547258
fviz_eig(z, addlabels = TRUE)
fviz_pca_var(z) + coord_fixed()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
fviz_pca_var(z, axes = c(2, 3)) + coord_fixed()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
M <- read_excel("../data/Trilobites.xlsx")
M
## # A tibble: 20 x 4 ## Genus Body_Length Glabella_Length Glabella_Width ## <chr> <dbl> <dbl> <dbl> ## 1 Acaste 23.1 3.5 3.77 ## 2 Balizoma 14.3 3.97 4.08 ## 3 Calymene 51.7 10.9 10.7 ## 4 Ceraurus 21.2 4.9 4.69 ## 5 Cheirurus 31.7 9.33 12.1 ## 6 Cybantyx 36.8 11.4 10.1 ## 7 Cybeloides 25.1 6.39 6.81 ## 8 Dalmanites 32.9 8.46 6.08 ## 9 Deiphon 21.8 6.92 9.01 ## 10 Ormathops 13.9 5.03 4.34 ## 11 Phacopidina 21.4 7.03 6.79 ## 12 Phacops 27.2 5.3 8.19 ## 13 Placopania 38.2 9.4 8.71 ## 14 Pricyclopyge 40.1 15.0 13.0 ## 15 Ptychoparia 62.2 12.2 8.71 ## 16 Rhenops 55.9 19 13.1 ## 17 Sphaerexochus 23.3 3.84 4.6 ## 18 Toxochasmops 46.1 8.15 11.4 ## 19 Trimerus 89.4 23.2 21.5 ## 20 Zacanthoides 47.9 13.6 11.8
z <- prcomp(~ .,
data = M,
scale. = TRUE,
center = TRUE)
summary(z)
## Importance of components: ## PC1 PC2 PC3 ## Standard deviation 1.6662 0.37627 0.2867 ## Proportion of Variance 0.9254 0.04719 0.0274 ## Cumulative Proportion 0.9254 0.97260 1.0000
## Standard deviations (1, .., p=3): ## [1] 1.6662015 0.3762685 0.2866962 ## ## Rotation (n x k) = (3 x 3): ## PC1 PC2 PC3 ## Body_Length 0.5727476 -0.7572887 0.3138058 ## Glabella_Length 0.5834596 0.1077154 -0.8049673 ## Glabella_Width 0.5757909 0.6441360 0.5035411
Suggested sample sizes vary:
All data:
PC1:
Everitt, B., and T. Hothorn. 2011. An Introduction to Applied Multivariate Analysis with R. Springer, New York.
Tabachnick, B. G., and L. S. Fidell. 2013. Using Multivariate Statistics. Pearson Education, Boston.